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Abstract 

We introduce a new protocol for prediction with expert advice in which 
each expert evaluates the learner's and his own performance using a loss 
function that may change over time and may be different from the loss 
functions used by the other experts. The learner's goal is to perform bet- 
ter or not much worse than each expert, as evaluated by that expert, for 
all experts simultaneously. If the loss functions used by the experts are all 
proper scoring rules and all mixable, we show that the defensive forecast- 
ing algorithm enjoys the same performance guarantee as that attainable 
by the Aggregating Algorithm in the standard setting and known to be 
optimal. This result is also applied to the case of "specialist" (or "sleep- 
ing") experts. In this case, the defensive forecasting algorithm reduces to 
a simple modification of the Aggregating Algorithm. 

1 Introduction 

We consider the problem of online sequence prediction. A process generates 
outcomes uji,uj2, - ■ ■ step by step. At each step t, a learner tries to guess the 
next outcome announcing his prediction 7^. Then the actual outcome LOt is 
revealed. The quality of the learner's prediction is measured by a loss function: 
the learner's loss at step t is A(7t,a;t). 

Prediction with expert advice is a framework that does not make any as- 
sumptions about the generating process. The performance of the learner is 
compared to the performance of several other predictors called experts. At each 
step, each expert gives his prediction 7", then the learner produces his own 
prediction 7^ (possibly based on the experts' predictions at the last step and 
the experts' predictions and outcomes at all the previous steps), and the accu- 
mulated losses are updated for the learner and for the experts. There are many 
algorithms for the learner in this framework; for a review, see [3]. 

In practical applications of the algorithms for prediction with expert advice, 
choosing the loss function is often a problem. The task may have no natural 
measure of loss, except the vague concept that the closer the prediction to the 
outcome the better. Thus one can select among several common loss functions, 
for example, the quadratic loss (reflecting the idea of least squares methods) or 



the logarithmic loss (which has an information theory background). A similar 
issue arises when experts themselves are prediction algorithms that optimize 
some losses internally. Then it is unfair to these experts when the learner 
competes with them according to a "foreign" loss function. 

This paper introduces a new version of the framework of prediction with 
expert advice where there is no single fixed loss function but some loss function 
is linked to every expert. The performance of the learner is compared to the 
performance of each expert according to the loss function linked to that expert. 
Informally speaking, each expert has to be convinced that the learner performs 
almost as well as, or better than, that expert himself. 

We prove that a known algorithm for the learner, the defensive forecasting 
algorithm [4] , can be applied in the new setting and gives the same performance 
guarantee as that attainable in the standard setting, provided all loss functions 
are proper scoring rules. 

Another framework to which our methods can be fruitfully applied is that 
of "specialist experts": see, e.g., [S], [T], and [IT]. We generahze some of the 
known results in the case of mixable loss functions. 

To keep presentation as simple as possible, we restrict ourselves to binary 
outcomes {0,1}, predictions from [0,1], and a finite number of experts. We 
formulate our results for mixable loss functions only. However, these results can 
be easily transferred to more general settings (non-binary outcomes, arbitrary 
prediction spaces, countably many experts, second-guessing experts, etc.) where 
the methods of [4] work. 

2 Prediction with simple experts' advice 

In this preliminary section we recall the standard protocol of prediction with 
expert advice and some known results. 

Let {0, 1} be the set of possible outcomes w, [0, 1] be the set of possible pre- 
dictions 7, and A : [0, 1] x {0, 1} [0, oo] be the loss function. The loss function 
A and parameter N (the number of experts) specify the game of prediction with 
expert advice. The game is played by Learner, Reality, and N experts, Expert 
1 to Expert N, according to the following protocol. 

Prediction with expert advice 
Lo := 0. 

L» :=0, n= l,...,iV. 
FORt= 1,2,...: 

Expert n announces 7" G [0, 1], rt = 1, . . . , iV. 

Learner announces "ft £ [0, 1]. 

Reality announces cot £ {0, 1}- 

Lt := Lt-i + X{jt,ujt). 

■.= L1_, + Xi-f^,u;t),n = l,...,N. 
END FOR 
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The goal of Learner is to keep his loss Lt smaller or at least not much greater 
than the loss L" of Expert n, at each step t and for all n = 1, . . . , A^. 
We only consider loss functions that have the following properties: 

Assumption 1: A(7,0) and A(7, 1) are continuous in 7 G [0,1] and for the 
standard (Aleksandrov's) topology on [0, 00]. 

Assumption 2: There exists 7 G [0, 1] such that A(7, 0) and A(7, 1) are both 
finite. 

Assumption 3: There exists no 7 G [0, 1] such that A(7, 0) and A(7, 1) are 
both infinite. 

The superprediction set for a loss function A is 

Sa := { [x, v) G [0, I 37 A(7, 0) < x and A(7, l)<y]. (1) 

By Assumption 2, this set is non-empty. For ?/ > 0, let E^i : [0, 00]^ — > [0, 1]^ 
be the homeoniorphism defined by E,f{x,y) :— (e"''^, e^**^). The loss function 
A is called rj-mixable if the set E^iCEx) is convex. It is called mixable if it is 
77-mixable for some 77 > 0. 

Theorem 1. If a loss function A is rj-mixahle, then there exists a strategy for 
Learner that guarantees that in the game of prediction with expert advice with 
N experts and the loss function A it holds, for all t and for all n ^ 1, . . . , TV, 
that 

Lt<L1 + -\nN. (2) 

■n 

The bound is optimal: if A is not rj-mixable, then no strategy for Learner can 
guarantee (0). 

For the proof and other details, see [3], [IHl) [IS]) or [TSl Theorem 8]; one of 
the algorithms guaranteeing dJ) is the (Strong) Aggregating Algorithm (AA). 
As shown in [J] , one can take the defensive forecasting algorithm instead of the 
AA in the theorem. 

3 Proper scoring rules 

A loss function A is a proper scoring rule if for any tt, tt' G [0, 1] it holds that 

7rA(7r, 1) + (1 - 7r)A(7r, 0) < 7rA(7r', 1) + (1 - 7r)A(7r', 0); 

it is a strictly proper scoring rule if the inequality holds with < in place of < 
whenever tt' 7^ tt. The interpretation is that the prediction vr is an estimate of the 
probability that lo = \. The definition says that the expected loss with respect 
to a probability distribution is minimal if the prediction is the true probability 
of 1. Informally, a strictly proper scoring rule encourages a forecaster (Learner 
or one of the experts) to announce his true subjective probability that the next 
outcome is 1. (See [B], [5], and [5] for detailed reviews.) 
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Simple examples of strictly proper scoring rules are provided by two most 
common loss functions: the log loss function 

A(7, uj) := - ln(aJ7 + (1 - (1 - 7)) 

(i.e., A(7, 0) = — ln(l — 7) and A(7, 1) = — In 7) and the square loss function 

^(7,^^) := - if ■ 
A trivial but important for us generalization of the log loss function is 

A(7,^) := -^ln(^7+(l-^)(l-7)), (3) 

where 77 is a positive constant. The generalized log loss function is also a proper 
scoring rule (in general, multiplying a proper scoring rule by a positive constant 
we again obtain a proper scoring rule). 

We will often say "(strictly) proper loss function" meaning a loss function 
that is a (strictly) proper scoring rule. Our main interest will be in loss functions 
that are both mixable and proper. Let C be the set of all such loss functions. 

4 Prediction with expert evaluators' advice 

In this section we consider a very general protocol of prediction with expert 
advice. The intuition behind special cases of this protocol will be discussed in 
the following sections. 

Prediction with expert evaluators' advice 

FORi = 1,2,...: 

Expert n announces 7^ 6 [0, 1], f?" > 0, and ?7"-mixable A" 6 £, 
n= l,...,iV. 

Learner announces 7* £ [0, 1]. 

Reality announces ut e {0, 1}. 
END FOR 

The main mathematical result of this paper is the following. 

Theorem 2. Learner has a strategy (e.g., the defensive forecasting algorithm 
described below) that guarantees that in the game of prediction with N expert 
evaluators ' advice it holds, for all T and for all n = 1, . . . ,N, that 

T 

t=i 

The description of the defensive forecasting algorithm and the proof of the 
theorem will be given in Section [T] 
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Corollary 1, For any r] > 0, Learner has a strategy that guarantees 

T T 

Y^Ki^uc^t) < ^ Ar(7r,^t) + — , (4) 
t=i t=i ^ 

for all T and all n — 1, ... , N , in the game of prediction with N expert evalu- 
ators' advice in which the experts are required to always choose rj-mixable loss 
functions A". 

This corollary is more intuitive than Theorem [2] as ([4]) compares the cumulative 
losses suffered by Learner and each expert. 

In the following sections we will discuss two interesting special cases of The- 
orem [2] and Corollary [TJ 

5 Prediction with constant expert evaluators' 
advice 

In the game of this section, as in the previous one, the experts are "expert 
evaluators" : each of them measures Learner's and his own performance using 
his own loss function, supposed to be mixable and proper. The difference is that 
now each expert is linked to a fixed loss function. The game is specified by iV 
loss functions A^, . . . , A^. 

Prediction with constant expert evaluators' advice 
4") :=0, n = l,...,iV. 
L» :=0, l,...,iV. 
FORt 1,2,...: 

Expert n announces 7" G [0, 1], n = 1, . . . , N. 

Learner announces 74 G [0, 1]. 

Reality announces ujt G {0, 1}. 

■.= Lt\+X-{^ucot),n=l,...,N. 
■.^L^_, + \"{-/^,iJt),n^l,...,N. 
END FOR 

There are two changes in the protocol as compared to the basic protocol of 
prediction with expert advice in Section O The accumulated loss L" of each 
expert is now calculated according to his own loss function A". For Learner, 
there is no single accumulated loss anymore. Instead, the loss L^"'' of Learner 
is calculated separately against each expert, according to that expert's loss 
function A". Informally speaking, each expert evaluates his own performance 
and the performance of Learner according to the expert's own (but publicly 
known) criteria. 

In the standard setting of prediction with expert advice it is often said that 
Learner's goal is to compete with the best expert in the pool. In the new setting, 
we cannot speak about the best expert: the experts' performance is evaluated 
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4"^<i? + ^. (5) 



by different loss functions and thus the losses may be measured on different 
scales. But it still makes sense to consider bounds on the regret Lj"-* — L" for 
each n. 

Theorem [5] (or Corollary [T]) immediately implies the following performance 
guarantee for the defensive forecasting algorithm in our current setting. 

Corollary 2. Suppose that every A" is a proper loss function that is rj"^ -mixable 
for some 77" > 0, n — l,...,iV. Then Learner has a strategy (such as the 
defensive forecasting algorithm) that guarantees that in the game of prediction 
with N experts' advice and loss functions , . . . , it holds, for all T and for 
all n= 1, . . . , N , that 

IniV 

The new bound ([5]) is precisely the same as the bound for the standard setting 
of Theorem [TJ But rigorous comparison of the actual power of these two bounds 
is not so trivial. 

Formally speaking, the task of Learner in the new protocol is not strictly 
harder and is not strictly easier than in the standard protocol: the task is 
incomparable. Learner must now compete with different experts by different 
rules. But this is not necessarily a disadvantage. Consider an example. Suppose 
that all experts except one are linked to one loss function and the last expert is 
linked to another loss function. And this last loss function is somehow trivial, 
say, equals 1 independent of the outcome and the prediction. Then we arrive 
at the standard protocol with N — 1 experts, since the regret against the last 
expert is zero independent of our predictions. In this example, we can get a 
better bound than that given by Corollary [2l This non-optimality is especially 
apparent in the case when we have a huge number of experts, but all except 
one are linked to a trivial loss function. Then our regret bound is large, being 
a logarithm of a huge number, whereas one can achieve zero regret against all 
experts whatever strategy they use — since the loss functions are unfavourable 
to the experts. 

Nevertheless, it is intuitively clear that the new protocol is somewhat harder 
for Learner in general. And Corollary [5] is really surprising: it is hard to believe 
that Learner can compete against several arbitrary loss functions as well as 
against only one of them. The reason why this is possible is that the loss 
functions are assumed to be proper. 



Multiobjective prediction with expert advice 

To conclude this section, let us consider another variant of the protocol with 
several loss functions. As mentioned in the introduction, sometimes we have 
experts' predictions, and we are not given a single loss function, but have several 
possible candidates. The most cautious way to generate Learner's predictions 
is to ensure that the regret is small against all experts and according to all loss 
functions. The following protocol formalizes this task. Now we have N experts 
and M loss functions , . . . , A*^ . 
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MULTIOBJECTIVE PREDICTION WITH EXPERT ADVICE 

4"^ := 0, m = 1, . . . , M. 

Lg'™ := 0, n = 1, . . . , TV and m = 1, . . . , M. 

FORi = 1,2,...: 

Expert n announces 7" £ [0, 1], n — 1, . . . , iV. 

Learner announces "ft S [0, 1]. 

Reality announces ujt £ {0, 1}. 

:=i(::')+A™(7t,c.0,m = l,...,M. 

L"'" := + A'"(7t",wt), n = 1, . . . , TV and m = 1, . . . , M. 
END FOR 



Corollary 3. Suppose that every A™ is an rf^ -mixable proper loss function, 
for some rj"^ > 0, m = 1, . . . ,M. The defensive forecasting algorithm guaran- 
tees that, in the multiobjective game of prediction with N experts and the loss 
functions , . . . , \^'^ , 

(m) ,-1 m \nAIN , , 

l>^' < L"'"" + ^^^^ (6) 

for all t, all n = 1, . . . , N , and all m = 1, . . . , M . 

Proof. This follows easily from Corollary [2l For each n e {1, . . . , N}, let us 
construct M new experts {n,m). Expert {n,m) predicts as Expert n and is 
linked to the loss function A™. Applying Corollary [2] to these MN experts, we 
get bound □ 

The last protocol is harder for Learner than the standard protocol when 
M > 1: Learner must satisfy all old regret bounds and also some new bounds. 
But the increase in the regret bounds is surprisingly small: only an additive 
term proportional to In A/. Whether the dependence on M in Corollary [3] is 
optimal remains an open problem. 

A further generalization of our last protocol involves a binary relation R 
between the N experts and the M loss functions, where nRm, n G {1, . . . , A''} 
and m e {!,..., M}, is interpreted as Expert n using the loss function A™ 
when evaluating Learner's and his own performance. It is assumed that for 
each n there exists at least one m such that nRm. The relation R is naturally 
represented as a bipartite graph connecting the vertices in the set {!,..., A^} to 
vertices in the set {1, . . . , M}. Equation ^ now becomes 

T (m) , jTi.rn , In AT 

for all (n, m) G R, where K is the cardinality of R (equivalently, the number of 
edges in the bipartite graph). 
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A simple example 

Let be the log loss function and the square loss function. As already 
mentioned, both loss functions are proper and mixable. It is known (see, e.g., 
[3], [TD], or [13]) that A^ is l-mixable and A^ is 2-inixable. Suppose we are com- 
peting with N experts producing predictions 7" under these two loss functions. 
The defensive forecasting algorithm ensures that the regret with respect to the 
logarithmic loss function is bounded by ln(2iV) < InTV + O.T, and the regret with 
respect to the square loss function is bounded by 0.51n(2iV) < 0.5 In + 0.4 — 
practically the same as the regrets against N experts that are achievable when 
Learner chooses his predictions with respect to one of the loss functions only. 

6 Prediction with specialist experts' advice 

The experts of this section are allowed to "sleep" , i.e., abstain from giving advice 
to Learner at some steps. This generalization is important for text-processing 
apphcations (see, e.g., [5]). We will be assuming that there is only one loss 
function A, although generalization to the case of N loss functions A"'^, . . . , A^ is 
straightforward. The loss function A does not need to be proper (but it is still 
required to be mixable). 

Let a be any object that does not belong to [0, 1]; intuitively, it will stand 
for an expert's decision to abstain. 

Prediction with specialist experts' advice 

4"^ :=0, n = l,...,iV. 

■=0,n^l,...,N. 
FORt = 1,2,...: 

Expert n announces 7" e ([0, 1] U {a}), n — 1, . . . , N. 
Learner announces 74 G [0, 1]. 
Reality announces ujt £ {0, 1}. 
Li"^ + I{^.^,}A(7t,c^t), n^l,...,N. 

:= i^-i + ]I{7,Va}A(7t", ^t), n = 1, . . . , iV. 
END FOR 

The indicator function I{^n^a} of the event 7" 7^ a is defined to be 1 if 7" 7^ a 

(n) 

and if 7" = a. Therefore, and refer to the cumulative loss of Learner 
and Expert n over the steps when Expert n is awake. Now Learner's goal is to 
do as well as each expert on the steps chosen by that expert. 

Corollary 4. Let X be a loss function that is rj-mixable for some rj > 0. Then 
Learner has a strategy (e.g., the defensive forecasting algorithm) that guarantees 
that in the game of prediction with N specialist experts ' advice and loss function 
A it holds, for all T and for all n ~ 1, . . . , N , that 



Proof. Without loss of generality the loss function A may be assumed to be 
proper (this can be achieved by reparameterization of the predictions 7 S [0, 1]). 
The protocol of this section then becomes a special case of the protocol of Section 
13] in which at each step each expert outputs 77" = 77 and either A" = A (when he 
is awake) or A" = (when he is asleep). (Alternatively, in which at each step 
each expert outputs A" = A and either 77" = rj, when he is awake, or 77" = 0, 
when he is asleep.) 

□ 

7 Defensive forecasting algorithm and the proof 
of Theorem [2] 

In this section we prove Theorem [51 Our proof is constructive: we explicitly 
describe the defensive forecasting algorithm achieving the bound in Theorem [51 

The algorithm 

For each n = 1, . . . , iV, let us define the function 

Q" : ([0, 1]^ X (0, oo)^ xC^ X [0, 1] x {0, 1})* ^ [0, oo] 

T 

Q" (7r, At, TTi, . . . , 7^, A^, ttt, lot) := [J e"" (^?('^-"')-^"(^"'"*)) , 

t=i 

(8) 

where 7" are the components of 7*, 77" are the components of 77*, and A" are 
the components of A' : 

7,-:= (7^...,7f), 

K •= (-^t ; ■ • ■ I ^t^)- 

As usual, the product Ytt^i interpreted as 1, so that (5"() = 1. The functions 
will usually be applied to := {j^, . . . , 7/^) the predictions made by all the 
A'' experts at step t, rj* := {r]} , . . . ,r]^) the learning rates chosen by the experts 
at step t, and A* := (Aj , . . . , X^) the loss functions used by the experts at step 
t. Notice that Q" does not depend on the predictions, learning rates, and loss 
functions of the experts other than Expert n. 
Set 

1 ^ 

n=l 

and 

/t(7r,w) := 
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- Q (t*,??', Al,7ri,wi, . . . ,7'_i,77'_i, A'_i,7rt_i,wt_i) , (9) 

where (tt, ranges over [0, 1] x {0, 1}; the expression oo — oo is understood as, 
say, 0. The defensive forecasting algorithm is defined in terms of the functions 
ft. 

Defensive forecasting algorithm 

FORt = 1,2,...: 

Read the experts' predictions 7* — (7^^, . . . , 7^^) e [0, 1]^, 

learning rates = {rj}, . . . , ijf) e (0, cxo)^, 

and loss functions A* = {Xj, . . . , A^) G C'^ . 
Define ft : [0, 1] x {0, 1} [-00, 00] by dH). 
If /t(0, 1) < 0, predict TTt :— and go to R. 
If /t(l, 0) < 0, predict nt :— 1 and go to R. 
Otherwise (if both ft{0, 1) > and /t(l, 0) > 0), 

take any tt satisfying /((tt, 0) — /((tt, 1) and predict iTt := tt. 
R: Read Reality's move uJt € {0, 1}. 
END FOR 

The existence of a tt satisfying /t(7r,0) = /((ti", 1) will be proved in Lemma [T] 
below. We will see that the function /((tt) :— /((tt, 1) — /((tt, 0) takes values 
of opposite signs at tt = and tt = 1. Therefore, a root of /t(7r) — can be 
found by, e.g., bisection (see [12], Chapter 9, for a review of bisection and more 
efficient methods, such as Brent's). 

Reductions 

The most important property of the defensive forecasting algorithm is that it 
produces predictions vrt such that the sequence 

Qt := Q{ji,Vi,^i, ■^1,(^1, ■ ■ ■,7t,Vt,K,'^t,(^t) (10) 

is non-increasing. This property will be proved later; for now, we will only 
check that it implies the bound on the regret term given in Theorem [2j Since 
the initial value Qo of Q is 1, we have Qt < 1 for all t. And since Q" > for all 
n, we have Q" < NQ for all n. Therefore, Q" , defined by (jTU]) with Q" in place 
of Q, is at most N at each step t. By the definition of Q" this means that 

T 

J2vH>^t{^uu;t) - Ar(7r,^0) < IniV, 

t=i 

which is the bound claimed in the theorem. 

In the proof of the inequalities Qo > Qi > ■ • • we will follow [4] (for a 
presentation adapted to the binary case, see [17]). The key fact we use is that 
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Q is a game-theoretic supermartingale. Let us define this notion and prove its 
basic properties. 

Let E be any non-empty set. A function S : {Ex [0, 1] x {0, 1})* — > (— oo, oo] 
is called a supermartingale (omitting "game-theoretic") if, for any T, any 
ei, . . . , eT G -E, any tti, . . . , ttt G [0, 1], and any wi, . . . , ujt-i G {0, 1}, it holds 
that 

7I'T'S'(ei, TTl, LJl, . . . , ct-i, ttt-i, WT-1, ct, ttt, 1) 

+ (1 - 7rT)S'(ei,7ri,wi, . . . , er-i, ttt-i, c^^t-i, er, ttt, 0) 

< S'(ei,7ri,u;i, . . . , er-i, ttt-i, wt-i). (H) 

Remark. The standard measure-theoretic notion of a supermartingale is ob- 
tained when the arguments tti , 7r2 , . . . in (jlip are replaced by the forecasts pro- 
duced by a fixed forecasting system. See, e.g., [13] for details. Game-theoretic 
supermartingales are referred to as "superfarthingales" in [7]. 

A supermartingale S is called forecast- continuous if, for all T £ {1,2, . . .}, 
all ei, . . . ,eT & E, all tti, . . . , ttt-i G [0, 1], and all wi, . . . , G {0, 1}, 

S'(ei, TTi, cji, . . . , ct-i, ttt-i, t^T~i, ct, tt, cjt) 

is a continuous function of tt G [0,1]. The following lemma states the most 
important for us property of forecast-continuous supermartingales. 

Lemma 1. Let S be a forecast- continuous supermartingale. For any T and 
for any values of the arguments ei, . . . , G E, tti, . . . , ttt-i G [0, 1], and 
uji, . . . ,lut-i G {0,1}, there exists tt G [0,1] such that, for both lu — and 

S{ei,ni,u;i, . . . , ct-i, ttt-i, t^^T-i, bt, tt, t^) 

< S'(ei,7ri,cji, . . . ,eT-i,TTT-i,uJT-i) ■ 
Proof. Define a function / : [0, 1] x {0, 1} — *■ {—oo, oo] by 

f{TT,uj) := S'(ei,7ri,wi, . . . , er-i, ttt-i, wt-i, er, tt, w) 

- S'(ei,7ri,a;i, . . . , ct-i, ttt-i, wt-i) 

(the subtrahend is assumed finite: there is nothing to prove when it is infinite). 
Since S" is a forecast-continuous supermartingale, /(tt, w) is continuous in tt and 

7r/(7r,l) + (l-7r)/(7r,0) <0 (12) 

for all TT e [0, 1]. In particular, /(O, 0) < and /(1, 1) < 0. 

Our goal is to show that for some tt G [0, 1] we have /(tt, 1) < and /(tt, 0) < 
0. If /(0, 1) < 0, we can take tt = 0. If /(I, 0) < 0, we can take tt = 1. Assume 
that /(0, 1) > and /(I, 0) > 0. Then the diff'erence 

/(^) :=/(7r,l)-/(7r,0) 

is positive for n — Q and negative for vr = 1 . By the intermediate value theorem, 
/(vr) = for some tt G (0, 1). By ((121) we have /(tt, 1) = /(7r,0) < 0. □ 
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The fact that the sequence pUj) is non-increasing follows from the fact (see 
below) that Q is a supermartingale (when restricted to the allowed moves for 
the players). The proof of Lemma [U as applied to the supermartingale Q, is 
summarized in ^ , the pseudocode for the defensive forecasting algorithm, and 
the paragraph following it. 

The weighted sum of finitely many forecast-continuous supermartingales 
taken with positive weights is again a forecast-continuous supermartingale. 
Therefore, the proof will be complete if we check that Q" is a forecast-continuous 
supermartingale under the restriction that A" is ry^-mixable for all n and t. But 
before we can do this, wc will need to do some preparatory work in the next 
subsection. 

Geometry of mixability and proper loss functions 

Assumption 1 and the compactness of [0, 1] imply that the superprediction set 
((!]) is closed. Along with the superprediction set, we will also consider the 
prediction set 

Tlx { {x, v) e [0, oo)2 I 37 A(7, 0) = x and A(7, 1) - y} . 

In many cases, the prediction set is the boundary of the superprediction set. 
The prediction set can also be defined as the set of points 

A^:=(A(7,0),A(7,1)) (13) 

where 7 ranges over the prediction space [0,1]. It is clear that the prediction 
set is compact. 

Let us fix a constant 77 > 0. The prediction set of the generalized log loss 
game ^ is the curve {{x, y) \ e~^^ + e^''^ = 1} in R^. For each tt e (0, 1), the 
TT -point of this curve is A^r, i.e., the point 

( --Ml - tt), --\mr] . 
\ V V J 

Since the generalized log loss function is proper, the minimum of (1 — tt)x + iry 
on the curve e"**^ -|- e"''^ = 1 is attained at the 7r-point; in other words, the 
tangent of e^''^ -1- e""*^ = 1 at the 7r-point is orthogonal to the vector (1 — tt, tt). 

A shift of the curve e~''"= + e"™ = 1 is the curve e^''^^^") + e-''(f-'^) = 1 
for some a, /3 G K (i.e., it is a parallel translation of e"''^ + e^''^ = 1 by some 
vector (a,/3)). The TT-point of this shift is the point (a,/3) + A^, where Att is 
the TT-point of the original curve e^^^ + c^''^ = 1. This provides us with a 
coordinate system on each shift of e"**^ + c^''^ = 1 (tt G (0, 1) serves as the 
coordinate of the corresponding 7r-point). 

It will be convenient to use the geographical expressions "Northeast" and 
"Southwest". A point {xi,yi) is Northeast of a point (0:2,^2) if a^i > X2 and 
yi > y2- A set A C is Northeast of a shift of e~^'^ + e'^^ = 1 if each point 
of A is Northeast of some point of the shift. Similarly, a point is Northeast 
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of a shift of c~^^ + c~^y = 1 (or of a straight hne with a negative slope) if 
it is Northeast of some point on that shift (or hne). "Northeast" is replaced 
by "Southwest" when the inequalities are < rather than >, and we add the 
attribute "strictly" when the inequalities are strict. 

It is easy to see that the loss function is ?7-mixable if and only if for each 
point (a, 6) on the boundary of the superprediction set there exists a shift of 
Q-v^ _l_ Q-vv ~ \ passing through (a, 6) such that the superprediction set lies 
to the Northeast of the shift. This follows from the fact that the shifts of 
Q-rtx _|_ f,-r,y _ ^ correspond to the straight hues with negative slope under the 
homeomorphism i?^: indeed, the preimage of ax + hy = c, where a > 0, 6 > 0, 
and c > 0, is ae"'''^ + 6e~''^ = c, which is the shift of e"''^ + e"''^ = 1 by the 



A similar statement for the property of being proper is: 

Lemma 2. Suppose the loss function A is rj-mixable. It is a proper loss Junction 
if and only ij for each n the superprediction set is to the Northeast of the shift 
of e~^^ _l_ Q-vv — I passing through A^r (as defined by il3\) ) and having A^ as 
its TT -point. 

Proof. The part "if" is obvious, so we will only prove the part "only if" . Let A be 
?7-mixable and proper. Suppose there exists tt such the shift Ai of e"**^ + e~''*' = 
1 passing through A^^ and having A^ as its 7r-point has some superpredictions 
strictly to its Southwest. Let s be such a superprediction, let A2 be the shift of 
_|_ Q-vv — I passing through A^ and s, and let ^3 be the tangent to Ai at 
the point A^. Then there are points on A2 between A^r and s that lie strictly to 
the Southwest of A3 (take any point on A2 between A^ and s that is sufficiently 
close to A^). By the 77-mixability of A these points must be superpredictions, 
which contradicts A being a proper loss function (since A3 is the straight line 
passing through Ajr and orthogonal to (1 — tt, tt)). □ 

Notice that we never assume our loss functions to be strictly proper. (Ge- 
ometrically, the difference between proper mixable loss functions and strictly 
proper mixable loss functions is that the former's prediction set is allowed to 
have corners.) 

Proof of the supermartingale property 



such that 7" is ?7"-mixable for all n = 1, . . . , A^. We will only be interested 
in the restriction of Q" and Q on (i? x [0, 1] x {0, 1})*; these restrictions are 
denoted with the same symbols. 



vector 




Let E C ([0,1] 



X (0, 00) 



X ) consist of sequences 
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The following lemma completes the proof of Theorem [51 We will prove 
it without calculations, unlike the proofs (of different but somewhat similar 
properties) presented in [J (and, specifically for the binary case, in [T7]). 

Lemma 3. The function Q" defined on {E x [0, 1] x {0, 1})* hy (0) is a super- 
martingale. 

Proof. It suffices to check that it is always true that 



^Texp (77? (A?;(7rT,l)-A?,(7?,l))) 

+ (l-^T)exp(r;J(A^(7rT,0)-A?,(7j,0))) < 1. 

To simplify the notation, we omit the indices n and T; this does not lead 
to any ambiguity. Using the notation (a, 6) A^r (A(7r, 0), A(7r, 1)) and 
(x, y) :— A-y — (A(7, 0), A(7, 1)), we can further simplify the last inequality to 

( 1 — tt) exp (77(0 — x))+7r exp {r/{b — y)) < 1. 

In other words, it suffices to check that the (super)prediction set lies to the 
Northeast of the shift 

exp ^—rj — a ln(l — ^" '^-'^P (^^V ^ ^ Iutt^^ = 1 (14) 



of the curve e ''^ + e ''^ = 1 . The vector by which ([14]) is shifted is 

a H — ln(l — tt), 6 H — Iutt 
V V 

and so (a, b) is the 7r-point of that shift. This completes the proof of the lemma: 
by Lemma [H the superprediction set indeed lies to the Northeast of that shift. 

□ 



A simple special case 

In the case where A" = A is the log loss function and 77" = 1 for all n and t, the 
supermartingale ([8|) (which is in fact a martingale now) becomes a likelihood 
ratio process: namely, it becomes the ratio 

where p, p € [0,1], stands for the probability measure on {0,1} such that 
p({l}) = p. The mixed martingale Q becomes the likelihood ratio with the 
Bayes mixture as the numerator, and it is easy to see that in this case defensive 
forecasting reduces to the Bayes rule. 
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8 Defensive forecasting for specialist experts 
and the AA 



In this section we will find a more explicit version of defensive forecasting in 
the case of specialist experts. Our algorithm will achieve a slightly more general 
version of the bound ([7]); namely, we will replace the In iV in ([7]) by — Inp" where 

is an a priori chosen weight for Expert n: all are non-negative and sum 
to 1. Without loss of generality all p" will be assumed positive (our algorithm 
can always be applied to the subset of experts with positive weights) . Let At 
be the set of awake experts at time t: At :— {n G {!,..., N} \ 7" ^ a}. 

Let A be an 77-mixable loss function. By the definition of mixability there 
exists a function . . . , Wfc, 71, . . . , jk) (called a substitution function) such 

that: 

• the domain of E consists of all sequences (ui, . . . ,Uk,"fi, ■ ■ ■ ,jk), for all 
/c = 0, 1, 2, . . ., of numbers Ui e [0, 1] summing to 1, 7ii + ■ • • + = 1, and 
predictions 71, . . . , 7fe G [0, 1]; 

• S takes values in the prediction space [0, 1]; 

• for any (wi, . . . , Uk,"fi, ■ ■ ■ ,"fk) in the domain of E, the prediction 7 := 
S(mi, . . . ,Mfe,7i, . . . ,7fc) satisfies 

fc 

Vcj G {0, 1} : e-"^(T^") > J2 G~'^^^^'-'^^Ui. (15) 

1=1 

Fix such a function E. Notice that its value S() on the empty sequence can be 
chosen arbitrarily, that the case /c = 1 is trivial, and that the case A: = 2 in fact 
covers the cases /c = 3, /c = 4, etc. 

Defensive forecasting algorithm for specialist experts 

w-^ :=p", n= 1,...,7V. 
FOR< = 1,2,...: 

Read the list At of awake experts 

and their predictions 7" G [0, 1], n G ^j. 

Predict ^t :=s(«_0„,^^,(7rW)' 

where <_i := <_i/E„eA, 
Read the outcome wt G {0, 1}. 

Set w'^ := w^_^e'>iH^t,^t)-xhr^'^t)} foj. n G At. 

END FOR 

This algorithm is a simple modification of the AA, and it becomes the AA when 
the experts are always awake. In the case of the log loss function, this algorithm 
was found by Freund et al. in this special case, Freund et al. derive the same 
performance guarantee as we do. 
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Derivation of the algorithm 

In this derivation we will need the following notation. For each history of the 
game, let A", n G {1, . . . , N}, be the set of steps at which Expert n is awake: 

A" :={ie{l,2,...}|neAa. 

For each positive integer k, [k] stands for the set {1, . . . , fc}. 

The method of defensive forecasting requires (cf. Corollary 2]) that at step 
T we should choose tt — ttt such that, for each uj G {0, 1}, 

ugAt te[T-i]nA" 

neA^ te[T-i]nA" 

ne[N] tG[T-l]nA" 

where stands for the complement of At in [N]: At [N] \ At- This 
inequality is equivalent to 

neAx te[T-i]nA" 

< ^ g')(^('rt,i^t)-A(7",wt)) 
wGAt tG[T-l]nA" 

and can be rewritten as 

g,,(A(,r,u.)-A(7?,u.))yn_^ ^ -^^ ^j_g) 

iiGAt 

where u^_-^ :— Wnt-il TlmeAT ™t-i ^'^^ normalized weights 

te[T-i]nA" 

Comparing ([TB)l and (fT51) . we can see that it suffices to set 
^:=s(K_0„,^,,(7?W)- 

Discussion of the algorithm 

The main difference of the algorithm of the previous subsection from the AA 
is in the way the experts' weights are updated. The weights of the sleeping 
experts are not changed, whereas the weights of the awake experts are multiplied 
by Q^iK'^t,'^t)-x{^^ ,i^t))^ Therefore, Learner's loss serves as the benchmark: the 
weight of an awake expert who performs better than Learner goes up, the weight 
of an awake expert who performs worse than Learner goes down, and the weight 
of a sleeping expert does not change. 
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